Destroyed pool fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed#570
Destroyed pool fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed#570kgofron wants to merge 3 commits intoareaDetector:masterfrom
Conversation
|
Hi Kaz! Have you seen #572 ? It would be nice to determine how these two interact, since you're currently registering your exit handler manually, and you could take advantage of |
|
You need the latest asyn, and the ADCore from the PR that Erico linked above. Then, follow these guidelines |
destructible-driversI compiled with destructible-drivers branch, but unfortunatly exit after acquisition still results in segmentation fault. Perhaps I missed something.
SummaryThe destructible-drivers branch (ADCore PR 572) fixes shutdown order (asyn calls shutdownPortDriver() and then deletes the driver), but it does not include the pool-safety fix from ADCore PR 570. So the crash you see is still the same: after the driver (and its NDArrayPool) are destroyed, pvAccess (PVA) can later call NDArray::release() on arrays that belonged to that pool → use-after-free → SIGSEGV. So ADTimePix3 needs both:
What to doApply the ADCore PR 570 (destroyed-pool) changes on top of your current destructible-drivers ADCore. That PR adds:
Ways to get that into your tree:
After ADCore has both destructible-drivers and the pool-safety logic, exit should no longer segfault. If you paste your ADCore branch/commit and the PR 570 patch or link, I can outline exact merge/cherry-pick steps or a minimal patch for your tree. |
Stack trace:“SIGSEGV on exit with ADCore master. Backtrace shows crash in NDArrayPool::release (ADCore) called from PVA teardown (freeNDArray → NDArray::release) after the driver and its pool are already destroyed. Full bt full and info sharedlibrary attached.” Attached: The full bt full output (and optionally info sharedlibrary) from your tpx3_SIGSEGV.md file, or the relevant frames (#0–#2, #37, #49, #65–#66, #71–#72, #79–#81) to see the PVA → NDArray release → pool release path and the exit-handler order. GDB analysis – SIGSEGV on exit (ADCore master)Build: ADTimePix3 IOC, run with ADCore current master (no PR 570, no destructible PR 572 in this run). Where it crashes#0 – NDArrayPool::release(this=0x555556ccc870, pArray=0x7fff3c001a00) at NDArrayPool.cpp:373 So the fault is inside the pool’s release() (use-after-free on the pool or its internals). Call chain (who called into the pool)#81–79 – User types exit → epicsExit(0) → C library runs atexit handlers. So: PVA is shutting down (atexit), destroying MonitorElements that still hold NDArray-backed data. Their deleter calls NDArray::release(), which calls NDArrayPool::release() on a pool that no longer exists. Why the pool is goneShutdown order is:
ConclusionRoot cause: Use-after-free in ADCore: NDArray::release() is called from PVA’s deleter after the driver’s NDArrayPool has already been destroyed. The crash is in NDArrayPool::release (ADCore), not in ADTimePix3. tpx3_SIGSEGV.md Why fix ADCore
So the use‑after‑free is between PVA’s lifetime and ADCore’s pool, not inside ADTimePix3. What ADTimePix3 can and cannot do
|
|
Strange, I've been working with pvxs lately and I thought that the server was stopped before the epicsAtExit hooks are run. Is it possible that the NDPluginPvxs is still holding SharedPVs referencing NDArrays? Is it destroyed at all? In that case, it seems fixing that would be the right thing to do. Then again I don't know what happens when NDPluginPva is used. And I recently learned that NDPlugin keeps a reference to the last NDArray, which keeps hanging around unless the NDPlugin is destroyed. So, maybe fixing this in the pool is not a bad thing. The reason why it seems a bit icky to me is that it looks like a band aid for a deeper ownership problem. But I also realize that changing how ownership works when the pool class is used all over the place and referenced using raw pointers is probably not going to happen. Sorry, I didn't have any time for this in the past few weeks, so I'm just throwing ideas out there. Ignore if stupid :) |
|
I suppose making an ADCore R4-* that leverages c++ smart pointers for this and other things (thus requiring c++ 11 and adjustments to all drivers) would probably require too much work to be worthwhile |
Segmentation fault
"fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed" refers to Segmentation fault after ioc exits, when acquisition was performed (memory/pool allocated).
Fix applied to ADCore 3.14.0 master.
Problem
When an IOC exits (e.g. user types exit) after acquisition has run, the process can hit a SIGSEGV (signal 11). The crash is in NDArrayPool::release() (or equivalent use of the pool) after the detector driver and its NDArrayPool have already been destroyed.
Cause: Shutdown order: the detector driver destructor runs and deletes pNDArrayPoolPvt_. Later, the pvAccess ServerContext is torn down (atexit). Its MonitorElements still hold NDArray-derived data. The deleter used by ntndArrayConverter (freeNDArray) calls NDArray::release() on those arrays. By then the pool is gone, so release() runs against freed memory → SIGSEGV.
This has been seen with areaDetector IOCs (e.g. ADTimePix3) using ADCore 3.12.1 and 3.14.0. See issue areaDetector/ADTimePix3#5.
Approach
Two parts:
“Destroyed pool” registry
So any late release() (from PVA or elsewhere) no-ops safely, even for NDArrays that are not the driver’s pArrays[] (e.g. copies handed to PVA).
asynNDArrayDriver destructor
Changes
ADCore314_fix.md
References